What is the role of the media coverage in explaining stock market fluctuations ?

Imports

Table of Contents


0. Introduction

Back to table of contents


This notebook is intended to show and demonstrate the thought process that went into this project, on identifying patterns and correlations in the stock market and media quotes. In particular we have focused on the Apple Stock market, as it is one of the most valuable company on earth and it is widely covered in the media.

1. Loading the datasets

Back to table of contents


In this project we have studied three datasets to provide more insights in the stock price evolution, the coverage of Apple in the media, and its relationship with the various speakers. This first session is dedicated to loading the various datasets and preprocessing the valuable informations.

1.1 Apple stock : yFinance API

Back to table of contents

The yFinance API is provided by Yahoo Finance and provides an easy access to various financial metrics for most stocks in the market. The ticker of the Apple stock is denoted AAPL, and we will first focus on a date range from 2008 up to 2020. We also provide an additional indicator of the daily volatility with the Liquidity field, which is the daily volume multiplied by the average daily stock price. This indicator in dollars is appropriate to have a quick overview on the quantity of Apple stock traded in a day, as a day of high liquidity is also said to be a day of high volatility.

1.2 Quotebank Dataset

Back to table of contents

The Quotebank data is a large text corpus of more than 178 million quotations scrapped over 337 websites. As we are focusing on Apple related quotations, we have applied a various amount of filtering to reduce the number of quotes to 310'816. Some of the various techniques employed are:

  1. White list of words that should be contained in quotes : Apple, iPhone, Macbook etc.
  2. Black list of words that should not be contained : _Mac n Cheese, apple, Big Apple etc.
  3. White list of speakers, as we also included speakers related to Apple regardless of the aforementioned white words : Steve Jobs, Tim Cook, Steve Wozniak etc.

This final dataset of Apple related quotes has been saved in a pkl file to improve the ease of manipulation, and can be accessed with the get_filtered_quotes function.

2. First look into the Apple stock market and related quotes

Back to table of contents

One objective of this project is to find patterns and events in both the stock market and media discussion related to Apple.


2.1 Observations on the volatility, liquidity traded and media discussion

Back to table of contents

In order to do so, we need metrics and qualitative visualization of the aforementioned data. To find days of interest in the stock market, we introduce the Liquidity, which is the mean price day multiplied by the daily volume of exchange. This indicator provides an intuition of the amount of $AAPL stock exchanged in day. A day of high liquidity will be synonymous of a day of high volatility, which may indicate a particular event related to Apple.

In the following plot we highlight the yearly top 2% of the liquidity days. In other words, we will observe each year the days with the highest liquidity. We observe a pattern that will be further studied, as the iPhone september events, or the quarterly reports.

Naturally we proceed to this comparison of high exchange days by looking at the daily price of the $AAPL stock. We highlight the same days of high volatility from the previous figures but next to the stock price instead of the liquidity. The intuition is that a day of high volatility may have repercussions on the stock price, either as a price fall or a rebound.

Finally we will take a look at the filtered Quotebank dataset, which only describes Apple related quote. In the same idea as previous plots, we can highlight the yearly top 2% of days with the most quotes. Again we can observe some patterns, that will be further dissected in the next few sections. For example we observe a yearly spike in September when the new iPhone is released, or in June with the yearly developer conference. Most importantly the highest spike the 6th October 2011 is related to Steve Jobs death, which was widely covered in the media.

2.2 Finding patterns and studying correlated features

Back to table of contents

From the above observations, we had the intuition that each quarter report released by Apple was synonymous with a day of high volatility. A quarterly report is a summary or collection of unaudited financial statements, such as gross revenue, net profit, operational expenses, and cash flow. As we have 252 trading days in a year and 4 quarter reports per year, we expect a periodicity of high liquidity days of around $252/4 = 64$ days. This in fact validated by the next cell, which performs a seasonal analysis over a wide range of days period, and keeps the days with the lowest null probability (i.e p-value). Later on we will see if this pattern is also observable on the Apple related quotes.

What we finally observe is that there is a qualitative correlation between the number of daily quotes related to Apple and the daily liquidity. The two are positively correlated (the Pearson correlation coefficient is approximately 0.3 and the p-value is very small), which indicates that an increase in liquidity will be likely associated with an increase in discussion related to Apple and the other way around.

3. Sentiment Analysis of Apple related quotes

Back to table of contents


3.1 Sentiment Prediction

Back to table of contents

One further study that we can perform on the set of quotes related to Apple is a sentimental analysis. Performing a sentiment distinction on the quotes may be useful to understanding the semantic orientation during a period of time, and find events of interest. We will use VADER, a lexicon and rule based sentiment analysis tool that can classify each quote in either positive, neutral or negative intent. More precisely, this predictive model gives a score in a $[-1,1]$ range, with $1$ being very positive, $-1$ very negative and $0$ neutral sentiment. In this section we will look at the correlation between the semantic content of those quotes and the actual stock price.

3.2 Sentiment and stock liquidity

Back to table of contents

Furthermore we can qualitatively estimate the correlation between distinct sentiments and the observed market liquidity. Just as seen in the previous section, we will compute the Pearson correlation between either the positive, negative or neutral quotes and the liquidity. As we found a correlation value of 0.3154 with all quotes regardless of the sentiment, here we expect to observe a lower score for negative quotes rather than positive or neutral.

Looking at the computed correlations, we have indeed a lower correlation score of 0.1653 for negative quotes, compared to 0.2146 and 0.2192 for positive or neutral quotes. As all those values are positive, we may say that regardless of the sentiment, an increase in the number of quotations is most likely associated with an increase in liquidity and volatility of the Apple stock. Most importantly we observe that positive quotes have a greater correlation than negative quotes. One hypothesis that we may establish is that while negative quotes may make it more likely to sell a higher volume of stocks than usual, positive quotes have higher influence on the volatility and the market may be more willing to buy more stocks.

4. Impact of the number of pageviews on the speakers' Wikipedia page

Back to table of contents


Previously, we have looked at the impact of the valence of the different quotes related to Apple on the stock market. Let's add more depth to our analysis of the impact of the media on the stock market. What is the impact of the speaker of a quote? The response is relatively simple, it depends on how well known the speaker was at the time he or she was quoted in the media. And a quite simple way to measure this is to use the number of pageviews on the speakers' Wikipedia page (if there is one!).

4.1 Wiki labels and wiki ID

Back to table of contents

Before accessing to the pageviews statistics, we need the exact label of the speakers' Wikipedia page. For this purpose, we load the following data set which links each speaker to its Wikipedia page.

4.2 Number of wiki page views

Back to table of contents

After that, we need a way to get the number of pageviews of the speakers annually. We use the package pageviewapi which gives us all the wiki page views since 2015. We would have liked to have all the pageviews since 2008, but it was too complicated. Indeed one person has created a way to get these data, but the website did not work anymore. Thus, we focused our study between 2015 and 2020, and we designed a function which returns the number of pageviews for a specific wiki page and year.

4.3 Exact label for each quote

Back to table of contents

The idea now is to get the exact label of each speaker for every quotes. However, the different quotes can have more than one QID! Indeed, sometimes the name of the speaker for a specific quote can be confused with another speaker, so it gives two QID in the list for that quote! To deal with this issue, we decided to look at all the different QID for each quote, and we kept the label which corresponds to the speaker having the maximum number of total pageviews.

Before that, we recover all the speakers' ID for all the quotes that are not anonymous.

Here we can clearly see that for one speaker, there might be more than one QID. Thus, in the following cell, we recover for all the speakers the true label in Wikipedia, i.e. the one for which the number of pageviews is the highest.

Remark : The following cell shows how the process is done over a sample of the full dataframe quotes. We are not doing the whole filtering below because the run is about 22h. Thus, we have done it once on multiple clusters and have saved the results in a pickle file.

4.4 Scoring quotes

Back to table of contents

Now we have the exact label of the speaker for every quotes. In the following, we want to get the wikipedia page views statistics for every speakers at the year where the quote was published. We take the label and the year of each quotes and use the function get_pageviews_per_year to add a new column score in our data frame.

Remark : As in the previous subsection, here is an example of how the process works on a small sample of the data set because the run for the whole data set is too long (around 8h for this one). Since the runnning time is so long, we decided to split the steps such that we can run it on multiple computer. Eventually, we were able to get the number of pageviews for each speaker for every year between 2015 and 2020. Here is an example of how the code works.

Now, we directly join the year of the quotes with the corresponding number of pageviews of the speakers. After that we normalize the value and we then obtain a score for each quotes (in absolute value for now).

Remark : The process here does not use the pageviewapi package, so it is pretty fast then we can directly do it on the whole data frame.

4.5 Get back to sentiment analysis

Back to table of contents

Now we have an absolute score value for each single quote which is a way to represent the notoriety of the speaker. Can we combine this information with the sentiment analysis for each quote used in the previous section? We will use again the function designed in the section 3.4 to get the valence of each quote (+1 for positive, -1 for negative and 0 for neutral). In the following cell, we show how the process works to add a column sentiment to our dataframe.

4.6 Final scoring

Back to table of contents

Finally, we multiply the valence of the quotes with the absolute score to get a positive and negative score. So we first create two columns negative_score and positive_score for each quote.

Example : If the quote has a negative valence, in column negative_score there will be the valence (-1) multiplied with the absolute score and in the column positive_score the value will be equal to zero.

Then, we sum the score every day. The idea is to identify the days for which a lot of famous people have talked about Apple positively or negatively. We keep only the columns date, positive_score and negative_score for the final plot. The following cell shows how the process is applied.

4.7 Final plot

Back to table of contents

After all these steps, we can finally have a concrete visualization of our results. We plot the positive_score and the negative_score depending of the date. In addition, we plot on the same figure the Apple stock price depending also of the date. The goal is to see if there is a visible correlation betwen the positive and negative scores and the stock price. Here is the following plot.

4.8 Distribution of the quotes according to their valence and to the fame of the speaker

Back to table of contents

In the following plot, we just wanted to visualize the distribution of the quotes according to their valence and to the fame of the speaker. We have chosen 6 events which were highly mediatized between 2015 and 2020. We chose 6 major events picked up from the 2% in this plot for which we were able to find relevant thematic associated e.g. the “FBI-Apple encryption dispute”, “Release of the iPhone X”, etc. Quotations were plotted according to their valence and to the fame of their speaker.


5. Building a model for stock market prediction

Back to table of contents

Using the quotes related to Apple, speakers data and past stock performance, we can perform a first attempt at predicting the daily stock price and the liquidity. For this section we will use the Facebook's Prophet library, which provides powerful and easy to use forecasting tools. At its core, the model is a modular linear regression model, that can take into account past performance and additional factors.

Remark: Prophet use a python translation layer of Ctan, and as the unresolved issue of printing all the Ctan computation iterations, and in the current state of the library there is no way to hide those outputs.

5.1 Model fitting and observations

Back to table of contents

In this final section we will attempt to train a time series forecasting model using the previous stock price and the score obtained in the previous section. We recall that this sentiment score is obtained using the number of wikipedia views for the speaker and the sentiment of their quote. As the wikipedia data only goes back as far as 2015, we will limit this section there. Furthermore we have observed in 2019 a stock split, which is when a company divides the price of a stock but multiply the amount of available shares by the same factor. This sudden price fall was unpredictable and does not indicate a pullback from the overall market. This unusual data is why we will limit this section from 2015 to 2017. Finally we will fit the model on 2017 and observe its prediction on 2018, compared to the actual price in 2019.

On the graph bellow, we can observe the fitted model on the 2015 to 2017 range, and its prediction on the 2019. First we observe that the true stock price in the first quarter of 2018 is quite below the predicted lower bound. However the model successfully predicts a price fall in July 2018, and from there the true price comes back to the prediction range. This first attempt for a model provides interesting insights in the periodicity of the data and shows that the additional feature indeed contribute to the model.

5.2 Model validation

Back to table of contents

To validate the parameters and formulation of our model, we will perform time series cross validation. The idea is to start with a model fitted on the first 150 days, predict the stock price on the following 60 days and evaluate the error. Once this first fold is done, we add the next 30 days of the original 150 days to the training frame, and predict again on the following 60 days. This will be illustraded on the below figure.

We will evaluate the mean absolute percentage of error, which provides the average percentage of error one can expect using the model to predict the stock price. $$MAPE =\frac{100 \%}{n} \sum_{t=1}^{n}\left|\frac{A_{t}-F_{t}}{A_{t}}\right| $$

We observed a MAPE of 0.04226, which means that using these model one should expect a prediction difference of 4.226% compared to the actual price.

5.3 Fine tunning the parameters

Back to table of contents

Finally we can use this cross validation technique to find what hyper parameters are expected to provide the best model on a test set. We will attempt to find better hyperparameters for two arguments of the prophet model:

  1. changepoint_prior_scale : This parameter determines the flexibility of the trend, and in particular how much the trend changes at the trend changepoints.
  2. seasonality_prior_scale : This parameter controls the flexibility of the seasonality. Similarly, a large value allows the seasonality to fit large fluctuations, a small value shrinks the magnitude of the seasonality.